Running R at Scale

with Apache Arrow on Spark

Javier Luraschi

Spark Summit 2019

Overview

  • Intro to R
  • R with Spark
  • Intro to Arrow
  • Arrow on Spark

Intro to R

R Language

CRAN Packages

R with Spark

sparklyr 0.4 - Initial Release

sparklyr 0.5 - Connections

sparklyr 0.6 - Distributed R

sparklyr 0.7 - Pipelines

sparklyr 0.8 - MLeap and Graphs

sparklyr 0.9 - Streams

sparklyr 1.0 - Arrow

  • Arrow enables faster and larger data transfers between Spark and R.
  • XGBoost enables training gradient boosting models over distributed datasets.
  • Broom converts Spark’s models into tidy formats that you know and love.
  • TFRecords writes TensorFlow records from Spark to support deep learning workflows.

Intro to Arrow

What is Arrow?

Arrow is a cross-language development platform for in-memory data.

The Feather project

The Ursa project

The Arrow R package

Arrow on Spark

Install Arrow

Then use Arrow with Spark and R:

Copy with Arrow

Collect with Arrow

Transform with Arrow

Thank you!

Resources

  • Docs: spark.rstudio.com
  • GitHub: github.com/rstudio/sparklyr
  • Blog: blog.rstudio.com/tags/sparklyr
  • R Help: community.rstudio.com
  • Spark Help: stackoverflow.com/tags/sparklyr
  • Issues: github.com/rstudio/sparklyr/issues
  • Chat: gitter.im/rstudio.sparklyr